331 Final Project

Author

Hallie Christopherson, Meyli Jaeger, Tyler Luby Howard, and Spruha Nayak

PC3: Joining Data

The Data

Total Health Spending per person (International $)
Source: https://www.fao.org/faostat/en/#home

Shows the average health expenditure per person, expressed in international dollars using PPP (purchasing power parity)

Sugar per person (g per day)
Source: https://www.who.int/gho/en/

Quantity of food consumption of sugars and sweeteners (g per person per day) 2004 data is a rough extrapolation

Set Up and Exploration

# Load libraries
library(tidyverse)
library(janitor)
library(knitr)
library(gganimate)
library(gifski)
# Load datasets
sugar <- read_csv('./sugar.csv')
spending <- read_csv('./spending.csv')
# Look at how many columns and rows
dim(sugar)
dim(spending)

# See unique values of countries
unique(sugar$country)[1:10]
unique(spending$country)[1:10]

# See first 6 rows of each dataset
sugar |>
  head() |>
  kable()

spending |>
  head() |>
  kable()

Cleaning and Joining Datasets

# Clean spending data
spending_long <- spending |>
  mutate(country = trimws(country)) |>
  clean_names() |>
  pivot_longer(-country, names_to = "year", values_to = "spending") |>
  mutate(year = as.integer(str_extract(year, "\\d{4}")))

spending_long <- spending_long |>
  distinct(country, year, .keep_all = TRUE) |>
  filter(!is.na(country) & !is.na(year))

# Clean sugar data
sugar_long <- sugar |>
  mutate(country = trimws(country)) |>
  clean_names() |>
  pivot_longer(-country, names_to = "year", values_to = "sugar") |>
  mutate(year = as.integer(str_extract(year, "\\d{4}")))

sugar_long <- sugar_long |>
  distinct(country, year, .keep_all = TRUE) |>
  filter(!is.na(country) & !is.na(year))

# Confirm all values are numeric 
sugar_long <- sugar_long |>
  mutate(sugar = as.numeric(sugar))
spending_long <- spending_long |>
  mutate(spending = as.numeric(spending))

str(sugar_long$sugar)
 num [1:10382] 14.4 12.8 13.2 15.5 16.3 16.7 17.4 17.8 14.5 18.8 ...
str(spending_long$spending)
 num [1:3040] NA NA NA NA NA NA NA 14.8 18.3 20.7 ...
# Join and remove rows with both values missing
sugar_spending <- full_join(spending_long, sugar_long, by = c("country", "year")) |>
  filter(!(is.na(spending) & is.na(sugar)))

New Joined Dataset

# Check the joined dataset
glimpse(sugar_spending)
Rows: 9,781
Columns: 4
$ country  <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "…
$ year     <int> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2…
$ spending <dbl> NA, NA, NA, NA, NA, NA, NA, 14.80, 18.30, 20.70, 21.90, 23.80…
$ sugar    <dbl> 8.44, 7.23, 7.86, 8.33, 8.90, 8.74, 8.41, 9.45, 12.20, 15.50,…
summary(sugar_spending)
   country               year         spending           sugar       
 Length:9781        Min.   :1961   Min.   :   2.02   Min.   :   0.0  
 Class :character   1st Qu.:1977   1st Qu.:  39.77   1st Qu.:  40.4  
 Mode  :character   Median :1993   Median : 156.00   Median :  81.5  
                    Mean   :1991   Mean   : 653.41   Mean   :  80.6  
                    3rd Qu.:2005   3rd Qu.: 564.00   3rd Qu.: 113.0  
                    Max.   :2018   Max.   :8360.00   Max.   :1410.0  
                                   NA's   :6773      NA's   :423     
sugar_spending |>
  head() |>
  kable()
country year spending sugar
Afghanistan 1995 NA 8.44
Afghanistan 1996 NA 7.23
Afghanistan 1997 NA 7.86
Afghanistan 1998 NA 8.33
Afghanistan 1999 NA 8.90
Afghanistan 2000 NA 8.74

Write-up

For this analysis, we are exploring two variables integral to understanding the evolution of global nutritional health and healthcare spending. The first is the number of sugars and sweeteners per person (measured in g per day). This data originates from the United Nations’ Food and Agriculture Organization’s FAOSTAT database, and is compiled by Gapminder. Data from 2004 was missing, therefore, 2004 is a rough extrapolation of values calculated by Gapminder.

The second variable is the total health spending per person as measured in international dollars, represented using purchasing power parity (PPP), a currency conversion rate that equalizes different currencies by removing differences in price levels amongst countries. This data comes from the World Health Organization’s Global Health Expenditure Database (GHED).

We hypothesize that these two variables are strongly related and that increases in sugar consumption result in rising health spending worldwide per person. An article from UC Berkeley Public Health written by Berthold (2023) supports this, explaining that a local soda tax in Oakland, CA resulted in a 26.8% drop in the purchase of sugar-sweetened beverages. We are exploring whether preventing diseases associated with these sugary beverages (diabetes, heart disease, stroke, gum disease) reduces health care costs, and are extrapolating this pattern to a global sphere.

To prepare the data for analysis, we combined two datasets: one reporting average sugar consumption per person per day and the other detailing health care spending per person, both measured by country and year. Before merging, we ensured consistency in country names and removed any observations lacking year or country information. Each dataset originally contained a wide format with multiple year columns; we reshaped them so that each row represented a single country-year observation. We identified and removed duplicate records for the same country and year to avoid overrepresentation. The final cleaned dataset contains 9,781 rows, incorporating all meaningful observations while excluding any instance where both sugar and spending values are missing. The original datasets had 179 and 190 rows, respectively, reflecting how the reshaping and merging process expanded the data based on multiple years per country.

References

Berthold, J. (2023, April 21). Sugary drink tax improves health, lowers health care costs: Sweetened beverage purchases drop 27% in Oakland, signaling potential impact of national legislation. University of California, Berkeley. Sugary drink tax improves health, lowers health care costs

Food and Agriculture Organization. (2024). FAOSTAT: Sugar & sweeteners food supply data. http://data.un.org/Data.aspx?q=Sugar&d=FAO&f=itemCode:2909

World Health Organization. (2024). Global Health Expenditure Database. https://apps.who.int/nha/database

PC4: Joining Data

Data Visualization

Static Visualization: Sugar Consumption vs. Health Spending by Country

# Average values per country (over all years)
country_summary <- sugar_spending |>
  group_by(country) |>
  summarize(avg_sugar = mean(sugar, na.rm = TRUE),
            avg_spending = mean(spending, na.rm = TRUE)) |>
  drop_na()

# Scatterplot with regression line
ggplot(country_summary, aes(x = avg_sugar, y = avg_spending)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Average Health Spending vs. Sugar Consumption by Country",
       x = "Average Daily Sugar Consumption (g per person)",
       y = "Average Health Spending (Intl $ per person)")

The scatterplot above displays the relationship between average daily sugar consumption (in grams) and average health spending per person (in international dollars), using country-level data averaged across multiple years. Each point represents one country. The upward-sloping regression line suggests that, in general, countries with higher sugar consumption tend to spend more on health care per person. However, the wide spread of the points around the line indicates that other factors beyond sugar intake likely influence health spending as well.

Animated Visualization: Sugar Consumption vs. Health Spending Over Time

sugar_spending_clean <- sugar_spending |>
  filter(!is.na(sugar) & !is.na(spending))

anim_plot <- ggplot(sugar_spending_clean, aes(x = sugar, y = spending)) +
  geom_point(alpha = 0.6, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Health Spending vs Sugar Consumption Over Time",
       subtitle = "Year: {frame_time}",
       x = "Sugar Consumption (g/person/day)",
       y = "Health Spending (Intl $/person)") +
  transition_time(year) +
  ease_aes('linear')

animate(anim_plot, nframes = 100, fps = 10, renderer = gifski_renderer())

This animation shows how the relationship between sugar consumption (g/person/day) and health spending (Intl $/person) has changed across countries from 1961 to 2018. Each point represents a country in a given year, and the red line shows the trend for that year using linear regression.

Overall, we see that as sugar consumption increases, health spending also tends to rise, though the strength of this relationship varies over time. This helps illustrate how global patterns in diet may relate to healthcare costs.

Linear Model

To further examine the relationship between sugar consumption and health spending, we fit a linear regression model to our data, specifically using country-level averages. Meaning, we summarized our data set by averaging sugar consumption and health spending across all years by country. We decided to do this since it stabilizes our trendlines since it abstracts away year to year fluctuations while still incorporating a long period of data collection.

lm_model <- lm(avg_spending ~ avg_sugar, data = country_summary)
model_summary <- summary(lm_model)
model_summary
coefficients <- coef(model_summary)
coefficients_df <- data.frame(
  Estimate = round(coefficients[, "Estimate"], 3),
  Std_Error = round(coefficients[, "Std. Error"], 3),
  t_value = round(coefficients[, "t value"], 3),
  p_value = signif(coefficients[, "Pr(>|t|)"], 3)
)

kable(coefficients_df, caption = "Linear Regression Coefficients")
Linear Regression Coefficients
Estimate Std_Error t_value p_value
(Intercept) -424.561 162.479 -2.613 0.00981
avg_sugar 12.853 1.715 7.495 0.00000